Lab 1 - Data Exploration and Analysis -- 2013/2014 CitiBike-NYC Data

Michael Smith, Alex Frye, Chris Boomhower ----- 1/29/2017

Image courtesy of http://newyorkeronthetown.com/, 2017

Business Understanding

Describe the purpose of the data set you selected

The data set selected by our group for this lab primarily consists of Citi Bike trip history collected and released by NYC Bike Share, LLC and Jersey Bike Share, LLC under Citi Bike's NYCBS Data Use Policy. Citi Bike is America's largest bike share program, with 10,000 bikes and 600 stations across Manhattan, Brooklyn, Queens, and Jersey City... 55 neighborhoods in all. As such, our data set's trip history includes all rental transactions conducted within the NYC Citi Bike system from July 1st, 2013 to February 28th, 2014. These transactions amount to 5562293 trips within this time frame. The original data set includes 15 attributes. That being said, our team was able to derive 15 more attributes from the original 15 as disussed in detail in the next section. Of particular note, however, we merged NYC weather data from the Carbon Dioxide Information Analysis Center (CDIAC) with the Citi Bike data to provide environmental insights into rental behavior as well.

The trip data was collected via Citi Bike's check-in/check-out system among 330 of its stations in the NYC system as part of its transaction history log. While the non-publicized data likely includes further particulars such as rider payment details, the publicized data is anonymized to protect rider identity while simultaneously offering bike share transportation insights to urban developers, engineers, academics, statisticians, and other interested parties. The CDIAC data, however, was collected by the Department of Energy's Oak Ridge National Laboratory for research into global climate change. While basic weather conditions are recorded by CDIAC, as included in our fully merged data set, the organization also measures atmospheric carbon dioxide and other radiatively active gas levels to conduct their research efforts.

Our team has taken particular interest in this data set as some of our team members enjoy both recreational and commute cycling. By combining basic weather data with Citi Bike's trip data, we expect to be able to predict whether riders are more likely to be (or become) Citi Bike subscribers based on ride environmental conditions, the day of the week for his/her trip, trip start and end locations, the general time of day (i.e. morning, midday, afternoon, evening, night) of his/her trip, his/her age and gender, etc. Deeper analysis may even yield further insights, such as identifying gaps in station location, for example. Furthermore, quantifiable predictions such as a rider's age as a function of trip distance and duration given other factors would provide improved targeting to bike share marketing efforts in New York City. Likewise, trip duration could be predicted based on other attributes which would allow the company to promote recreational cycling via factor adjustments within its control. By leveraging some of the vast number of trip observations as training data and others as test data via randomized selection, we expect to be able to measure the effectiveness of our algorithms and models throughout the semester.

Data Understanding

Describe the meaning and type of data

Before diving into each attribute in detail, one glaring facet of this data set that needs mentioning is its inherent time-series nature. By no means was this overlooked when we decided upon these particular data. To mitigate the effects of time on our analysis results, we have chosen to aggregate time-centric attributes such as dates and hours of the day by replacing them with simply the day of the week or period of the day (more on these details shortly). For example, by identifying trips occurring on July 1st, 2013, not by the date of occurrence but rather the day of the week, Monday, and identifying trips on July 2nd, 2013, as occurring on Tuesday, we will be able to obtain a "big picture" understanding of trends by day of the week instead of at the date-by-date level. We understand this is not a perfect solution since the time-series component is still an underlying factor in trip activity, but it is good enough to answer the types of questions we hope to target as described in the previous section as we will be comparing all Mondays against all Tuesdays, etc.

As mentioned previously, the original data set from Citi Bike included 15 attributes. These 15 attributes and associated descriptions are provided below:

  1. tripduration - Integer - The total time (in seconds) a bike remains checked out, beginning with the start time and ending with the stop time
  2. starttime - DateTime - The date and time at which a bike was checked out, marking the start of a trip (i.e. 2/12/2014 8:16)
  3. stoptime - DateTime - The date and time at which a bike was checked back in, marking the end of a trip (i.e. 2/12/2014 8:16)
  4. start_station_id - String - A categorical number value used to identify Citi Bike stations, in this case the station from which a bike is checked out
  5. start_station_name - String - The name of the station from which a bike is checked out; most often the name of an intersection (i.e. E 39 St & 2 Ave)
  6. start_station_latitude - Float - The latitude coordinate for the station from which a bike is checked out (i.e. 40.74780373)
  7. start_station_longitude - Float - The longitude coordinate for the station from which a bike is checked out (i.e. -73.9900262)
  8. end_station_id - String - A categorical number value used to identify Citi Bike stations, in this case the station in which a bike is checked in
  9. end_station_name - String - The name of the station at which a bike is checked in; most often the name of an intersection (i.e. E 39 St & 2 Ave)
  10. end_station_latitude - Float - The latitude coordinate for the station at which a bike is checked in (i.e. 40.74780373)
  11. end_station_longitude - Float - The longitude coordinate for the station at which a bike is checked in (i.e. -73.9900262)
  12. bikeid - String - A categorical number value used to identify a particular bike; each bike in the bike share network has its own unique number
  13. usertype - String - A classifier attribute identifying a rider as a bike share subscriber or a one-time customer (i.e. Subscriber vs. Customer)
  14. birth_year - Integer - The year a rider was born (Only available for subscribed riders, however)
  15. gender - String - A categorical number value representing a rider's gender (i.e. 0 = unknown, 1 = male, 2 = female)

It is important to note that birth year and gender details are not available for "Customer" user types but rather for "Subscriber" riders only. Fortunately, these are the only missing data values among all trips in the data set. Unfortunately, however, it means that we will not be able to identify the ratio of males-to-females that are not subscribed or use age to predict subcribers vs. non-subscribers (Customers). More to this end will be discussed in the next section.

It is also worth mentioning that while attributes such as trip duration, start and end stations, bike ID, and basic rider details were collected and shared with the general public, care was taken by Citi Bike to remove trips taken by staff during system service appointments and inspections, trips to or from "test" stations which were employed during the data set's timeframe, and trips lasting less than 60 seconds which could indicate false checkout or re-docking efforts during checkin.

Because some attributes may be deemed as duplicates (i.e. start_station_id, start_station_name, and start_station_latitude/longitude for identifying station locations), we chose to extract further attributes from the base attributes at hand. Further attributes were also extracted to mitigate the effects of time. In addition, we felt increased understanding could be obtained from combining weather data for the various trips as discussed in the previous section. These additional 10 attributes are described below:

  1. LinearDistance - Integer - The distance (miles) from a start station to an end station (as a crow flies); calculated from the latitude/longitude coordinates of start/end stations
  2. DayOfWeek - String - The day of the week a trip occurs regardless of time of day, month, etc.; extracted from the starttime attribute (i.e. Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday)
  3. TimeOfDay - String - The portion of the day during which a bike was checked out; extracted from the starttime attribute (i.e. Morning, Midday, Afternoon, Evening, Night)
  4. HolidayFlag - String - A categorical binary value used to identify whether the day a trip occurred was on a holiday or not; extracted from the starttime attribute (i.e. 0 = Non-Holiday, 1 = Holiday)
  5. Age - Integer - The age of a rider at the time of a trip; calculated based on the birth_year attribute (Since only birth year is included in original Citi Bike data set, exact age at time of trip when considering birth month is not possible)
  6. PRCP - Float - The total recorded rainfall in inches on the day of a trip; merged from the CDIAC weather data set
  7. SNOW - Float - The total recorded snowfall in inches on the day of a trip; merged from the CDIAC weather data set
  8. TAVE - Integer - The average temperature throughout the day on which a trip occurs; merged from the CDIAC weather data set
  9. TMAX - Integer - The maximum temperature on the day on which a trip occurs; merged from the CDIAC weather data set
  10. TMIN - Integer - The minimum temperature on the day on which a trip occurs; merged from the CDIAC weather data set

After extracting our own attributes and merging weather data, the total number of attributes present in our final data set is 25. Only 15 are used throughout this lab, however, due to the duplicate nature of some attributes as discussed already. This final list of used attributes are tripduration, DayOfWeek, TimeOfDay, HolidayFlag, start_station_name, start_station_latitude, start_station_longitude, usertype, gender, Age, PRCP, SNOW, TAVE, TMAX, and TMIN.

Compiling Multiple Data Sources

To begin our analysis, we need to load the data from our source .csv files. Steps taken to pull data from the various source files are as follows:

  • For each file from CitiBike, we process each line appending manually computed columns [LinearDistance, DayOfWeek, TimeOfDay, & HolidayFlag].
  • Similarly, we load our weather data .csv file.
  • With both source file variables gathered, we append the weather data to our CitiBike data by matching on the date.
  • To avoid a 2 hour run-time in our analysis every execution, we load the final version of the data into .CSV files. Each file consists of 250000 records to reduce file size for GitHub loads.
  • All above logic is skipped if the file "Compiled Data/dataset1.csv" already exists.

Below you will see this process, as well as import/options for needed python modules throughout this analysis.

In [1]:
import os
import sys
import re
from geopy.distance import vincenty
import holidays
from datetime import datetime
from dateutil.parser import parse
import glob
import pandas as pd
import numpy as np
from IPython.display import display
import gmaps
import plotly as py
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import statistics
from scipy.stats.stats import pearsonr

py.offline.init_notebook_mode()

pd.options.mode.chained_assignment = None

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
In [ ]:
starttime = datetime.now()
print(starttime)

if os.path.isfile("Compiled Data/dataset1.csv"):
    print("Found the File!")
else:
    citiBikeDataDirectory = "Citi Bike Data"
    citiBikeDataFileNames =[
        "2013-07 - Citi Bike trip data - 1.csv",
        "2013-07 - Citi Bike trip data - 2.csv",
        "2013-08 - Citi Bike trip data - 1.csv",
        "2013-08 - Citi Bike trip data - 2.csv",
        "2013-09 - Citi Bike trip data - 1.csv",
        "2013-09 - Citi Bike trip data - 2.csv",
        "2013-10 - Citi Bike trip data - 1.csv",
        "2013-10 - Citi Bike trip data - 2.csv",
        "2013-11 - Citi Bike trip data - 1.csv",
        "2013-11 - Citi Bike trip data - 2.csv",
        "2013-12 - Citi Bike trip data.csv",
        "2014-01 - Citi Bike trip data.csv",
        "2014-02 - Citi Bike trip data.csv"
    ]

    weatherDataFile = "Weather Data/NY305801_9255_edited.txt"

    citiBikeDataRaw = []

    for file in citiBikeDataFileNames:
        print(file)
        filepath = citiBikeDataDirectory + "/" + file
        with open(filepath) as f:
            lines = f.read().splitlines()
            lines.pop(0)  # get rid of the first line that contains the column names
            for line in lines:
                line = line.replace('"', '')
                line = line.split(",")
                sLatLong = (line[5], line[6])
                eLatLong = (line[9], line[10])

                distance = vincenty(sLatLong, eLatLong).miles
                line.extend([distance])

                ## Monday       = 0
                ## Tuesday      = 1
                ## Wednesday    = 2
                ## Thursday     = 3
                ## Friday       = 4
                ## Saturday     = 5
                ## Sunday       = 6
                if parse(line[1]).weekday() == 0:
                    DayOfWeek = "Monday"
                elif parse(line[1]).weekday() == 1:
                    DayOfWeek = "Tuesday"
                elif parse(line[1]).weekday() == 2:
                    DayOfWeek = "Wednesday"
                elif parse(line[1]).weekday() == 3:
                    DayOfWeek = "Thursday"
                elif parse(line[1]).weekday() == 4:
                    DayOfWeek = "Friday"
                elif parse(line[1]).weekday() == 5:
                    DayOfWeek = "Saturday"
                else:
                    DayOfWeek = "Sunday"
                line.extend([DayOfWeek])

                ##Morning       5AM-10AM
                ##Midday        10AM-2PM
                ##Afternoon     2PM-5PM
                ##Evening       5PM-10PM
                ##Night         10PM-5AM

                if parse(line[1]).hour >= 5 and parse(line[1]).hour < 10:
                    TimeOfDay = 'Morning'
                elif parse(line[1]).hour >= 10 and parse(line[1]).hour < 14:
                    TimeOfDay = 'Midday'
                elif parse(line[1]).hour >= 14 and parse(line[1]).hour < 17:
                    TimeOfDay = 'Afternoon'
                elif parse(line[1]).hour >= 17 and parse(line[1]).hour < 22:
                    TimeOfDay = 'Evening'
                else:
                    TimeOfDay = 'Night'
                line.extend([TimeOfDay])

                ## 1 = Yes
                ## 0 = No
                if parse(line[1]) in holidays.UnitedStates():
                    holidayFlag = "1"
                else:
                    holidayFlag = "0"
                line.extend([holidayFlag])

                citiBikeDataRaw.append(line)
            del lines

    with open(weatherDataFile) as f:
        weatherDataRaw = f.read().splitlines()
        weatherDataRaw.pop(0)  # again, get rid of the column names
        for c in range(len(weatherDataRaw)):
            weatherDataRaw[c] = weatherDataRaw[c].split(",")
            # Adjust days and months to have a leading zero so we can capture all the data
            if len(weatherDataRaw[c][2]) < 2:
                weatherDataRaw[c][2] = "0" + weatherDataRaw[c][2]
            if len(weatherDataRaw[c][0]) < 2:
                weatherDataRaw[c][0] = "0" + weatherDataRaw[c][0]

    citiBikeData = []

    while (citiBikeDataRaw):
        instance = citiBikeDataRaw.pop()
        date = instance[1].split(" ")[0].split("-")  # uses the start date of the loan
        for record in weatherDataRaw:
            if (str(date[0]) == str(record[4]) and str(date[1]) == str(record[2]) and str(date[2]) == str(record[0])):
                instance.extend([record[5], record[6], record[7], record[8], record[9]])
                citiBikeData.append(instance)

    del citiBikeDataRaw
    del weatherDataRaw

    # Final Columns:
    #  0 tripduration
    #  1 starttime
    #  2 stoptime
    #  3 start station id
    #  4 start station name
    #  5 start station latitude
    #  6 start station longitude
    #  7 end station id
    #  8 end station name
    #  9 end station latitude
    # 10 end station longitude
    # 11 bikeid
    # 12 usertype
    # 13 birth year
    # 14 gender
    # 15 start/end station distance
    # 16 DayOfWeek
    # 17 TimeOfDay
    # 18 HolidayFlag
    # 19 PRCP
    # 20 SNOW
    # 21 TAVE
    # 22 TMAX
    # 23 TMIN

    maxLineCount = 250000
    lineCounter = 1
    fileCounter = 1
    outputDirectoryFilename = "Compiled Data/dataset"
    f = open(outputDirectoryFilename + str(fileCounter) + ".csv", "w")
    for line in citiBikeData:
        if lineCounter == 250000:
            print(f)
            f.close()
            lineCounter = 1
            fileCounter = fileCounter + 1
            f = open(outputDirectoryFilename + str(fileCounter) + ".csv", "w")
        f.write(",".join(map(str, line)) + "\n")
        lineCounter = lineCounter + 1

    del citiBikeData
        
endtime = datetime.now()
print("RunTime: ")
print(endtime-starttime)
Loading the Compiled Data from CSV

Now that we have compiled data files from both CitiBike and the weather data, we want to load that data into a Pandas dataframe for analysis. We iterate and load each file produced above, then assign each column with their appropriate data types. Additionally, we compute the Age Column after producing a default value for missing "Birth Year" values. This is discussed further in the Data Quality section.

In [2]:
%%time

# Create CSV Reader Function and assign column headers
def reader(f, columns):
    d = pd.read_csv(f)
    d.columns = columns
    return d

# Identify All CSV FileNames needing to be loaded
path = r'Compiled Data'
all_files = glob.glob(os.path.join(path, "*.csv"))

    # Define File Columns
columns = ["tripduration", "starttime", "stoptime", "start_station_id", "start_station_name", "start_station_latitude",
           "start_station_longitude", "end_station_id", "end_station_name", "end_station_latitude",
           "end_station_longitude", "bikeid", "usertype", "birth year", "gender", "LinearDistance", "DayOfWeek",
           "TimeOfDay", "HolidayFlag", "PRCP", "SNOW", "TAVE", "TMAX", "TMIN"]

    # Load Data
CitiBikeDataCompiled = pd.concat([reader(f, columns) for f in all_files])

    # Replace '\N' Birth Years with Zero Values
CitiBikeDataCompiled["birth year"] = CitiBikeDataCompiled["birth year"].replace(r'\N','0')

    # Convert Columns to Numerical Values
CitiBikeDataCompiled[['tripduration', 'birth year', 'LinearDistance','PRCP', 'SNOW', 'TAVE', 'TMAX', 'TMIN']]\
    = CitiBikeDataCompiled[['tripduration', 'birth year','LinearDistance', 'PRCP', 'SNOW', 'TAVE', 'TMAX',
                            'TMIN']].apply(pd.to_numeric)

    # Convert Columns to Date Values
CitiBikeDataCompiled[['starttime', 'stoptime']] \
    = CitiBikeDataCompiled[['starttime', 'stoptime']].apply(pd.to_datetime)

    # Compute Age: 0 Birth Year = 0 Age ELSE Compute Start Time Year Minus Birth Year
CitiBikeDataCompiled["Age"] = np.where(CitiBikeDataCompiled["birth year"]==0, 0,
                                       CitiBikeDataCompiled["starttime"].dt.year - CitiBikeDataCompiled["birth year"])

    # Convert Columns to Str Values
CitiBikeDataCompiled[['start_station_id', 'end_station_id', 'bikeid', 'HolidayFlag', 'gender']] \
    = CitiBikeDataCompiled[['start_station_id', 'end_station_id', 'bikeid', 'HolidayFlag','gender']].astype(str)
Wall time: 2min 51s
In [3]:
print(len(CitiBikeDataCompiled))
display(CitiBikeDataCompiled.head())
5562293
tripduration starttime stoptime start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bikeid usertype birth year gender LinearDistance DayOfWeek TimeOfDay HolidayFlag PRCP SNOW TAVE TMAX TMIN Age
0 308 2014-02-28 23:59:10 2014-03-01 00:04:18 353 S Portland Ave & Hanson Pl 40.685396 -73.974315 365 Fulton St & Grand Ave 40.682232 -73.961458 14761 Subscriber 1982 1 0.709731 Friday Night 0 0.0 0.0 17 24 9 32
1 304 2014-02-28 23:58:17 2014-03-01 00:03:21 497 E 17 St & Broadway 40.737050 -73.990093 334 W 20 St & 7 Ave 40.742388 -73.997262 17112 Subscriber 1968 1 0.526555 Friday Night 0 0.0 0.0 17 24 9 46
2 1355 2014-02-28 23:57:55 2014-03-01 00:20:30 470 W 20 St & 8 Ave 40.743453 -74.000040 302 Avenue D & E 3 St 40.720828 -73.977932 15608 Subscriber 1985 2 1.945255 Friday Night 0 0.0 0.0 17 24 9 29
3 848 2014-02-28 23:57:13 2014-03-01 00:11:21 498 Broadway & W 32 St 40.748549 -73.988084 432 E 7 St & Avenue A 40.726218 -73.983799 17413 Subscriber 1976 1 1.557209 Friday Night 0 0.0 0.0 17 24 9 38
4 175 2014-02-28 23:57:12 2014-03-01 00:00:07 383 Greenwich Ave & Charles St 40.735238 -74.000271 284 Greenwich Ave & 8 Ave 40.739017 -74.002638 15220 Subscriber 1956 1 0.288829 Friday Night 0 0.0 0.0 17 24 9 58

Data Quality

Measurable Data Quality Factors

When analyzing our final dataset for accurate measures, there are a few key factors we can easily verify/research:

  • Computational Accuracy: Ensure data attributes added by computation are correct

    • TimeOfDay
    • DayOfWeek
    • HolidayFlag
  • Missing Data from Source

  • Duplicate Data from Source
  • Outlier Detection
  • Sampling to 500,000 Records for further analysis
Immesurable Data Quality Factors

Although we are able to research these many factors, one computation still may still be lacking information in this dataset. Our LinearDistance attribute computes the distance from one lat/long coordinate to another. This attribute does not however tell us the 'true' distance a biker traveled before returning the bike. Some bikers may be biking for exercise around the city with various turns and loops, whereas others travel the quickest path to their destination. Because our dataset limits us to start and end locations, we do not have enough information to accurately compute distance traveled. Because of this, we have named the attribute "LinearDistance" rather than "DistanceTraveled".

Below we will walk through the process of researching the 'Measureable' data quality factors mentioned above:

Computational Accuracy:TimeOfDay

To help mitigate challenges with time series data, we have chosen to break TimeOfDay into 5 categories. These Categories are broken down below:

  • Morning 5 AM - 10 AM
  • Midday 10 AM - 2 PM
  • Afternoon 2 PM - 5 PM
  • Evening 5 PM - 10 PM
  • Night 10 PM - 5 AM

To ensure that these breakdowns are accurately computed, we pulled the distinct list of TimeOfDay assignments by starttime hour. Looking at the results below, we can verify that this categorization is correctly being assigned.

In [4]:
    # Compute StartHour from StartTime
CitiBikeDataCompiled["StartHour"] = CitiBikeDataCompiled["starttime"].dt.hour

    # Compute Distinct Combinations of StartHour and TimeOfDay
DistinctTimeOfDayByHour = CitiBikeDataCompiled[["StartHour", "TimeOfDay"]].drop_duplicates().sort_values("StartHour")

    # Print
display(DistinctTimeOfDayByHour)

    #Clean up Variables
del CitiBikeDataCompiled["StartHour"]
StartHour TimeOfDay
9517 0 Night
9482 1 Night
9470 2 Night
9457 3 Night
9437 4 Night
9362 5 Morning
9147 6 Morning
8642 7 Morning
7644 8 Morning
6866 9 Morning
6452 10 Midday
6113 11 Midday
5696 12 Midday
5228 13 Midday
4734 14 Afternoon
4199 15 Afternoon
3460 16 Afternoon
2405 17 Evening
1464 18 Evening
851 19 Evening
503 20 Evening
298 21 Evening
128 22 Night
0 23 Night
Computational Accuracy:DayOfWeek

In order to verify our computed DayOfWeek column, we have chosen one full week from 12/22/2013 - 12/28/2013 to validate. Below is a calendar image of this week to baseline our expected results:

To verify these 7 days, we pulled the distinct list of DayOfWeek assignments by StartDate (No Time). If we can verify one full week, we may justify that the computation is correct across the entire dataset. Looking at the results below, we can verify that this categorization is correctly being assigned.

In [5]:
    # Create DataFrame for StartTime, DayOfWeek within Date Threshold
CitiBikeDayOfWeekTest = CitiBikeDataCompiled[(CitiBikeDataCompiled['starttime'].dt.year == 2013)
                                             & (CitiBikeDataCompiled['starttime'].dt.month == 12)
                                             & (CitiBikeDataCompiled['starttime'].dt.day >= 22)
                                             & (CitiBikeDataCompiled['starttime'].dt.day <= 28)][
    ["starttime", "DayOfWeek"]]

    # Create FloorDate Variable as StartTime without the timestamp
CitiBikeDayOfWeekTest["StartFloorDate"] = CitiBikeDayOfWeekTest["starttime"].dt.strftime('%m/%d/%Y')

    # Compute Distinct combinations
DistinctDayOfWeek = CitiBikeDayOfWeekTest[["StartFloorDate", "DayOfWeek"]].drop_duplicates().sort_values(
    "StartFloorDate")

    #Print
display(DistinctDayOfWeek)

    # Clean up Variables
del CitiBikeDayOfWeekTest
del DistinctDayOfWeek
StartFloorDate DayOfWeek
107323 12/22/2013 Sunday
100367 12/23/2013 Monday
89342 12/24/2013 Tuesday
86082 12/25/2013 Wednesday
76319 12/26/2013 Thursday
64599 12/27/2013 Friday
52577 12/28/2013 Saturday
Computational Accuracy:HolidayFlag

Using the same week as was used to verify DayOfWeek, w can test whether HolidayFlag is set correctly for the Christmas Holiday. We pulled the distinct list of HolidayFlag assignments by StartDate (No Time). If we can verify one holiday, we may justify that the computation is correct across the entire dataset. Looking at the results below, we expect to see HolidayFlag = 1 only for 12/25/2013.

In [6]:
    # Create DataFrame for StartTime, HolidayFlag within Date Threshold
CitiBikeHolidayFlagTest = CitiBikeDataCompiled[(CitiBikeDataCompiled['starttime'].dt.year == 2013)
                                             & (CitiBikeDataCompiled['starttime'].dt.month == 12)
                                             & (CitiBikeDataCompiled['starttime'].dt.day >= 22)
                                             & (CitiBikeDataCompiled['starttime'].dt.day <= 28)][
    ["starttime", "HolidayFlag"]]

    # Create FloorDate Variable as StartTime without the timestamp
CitiBikeHolidayFlagTest["StartFloorDate"] = CitiBikeHolidayFlagTest["starttime"].dt.strftime('%m/%d/%Y')

    # Compute Distinct combinations
DistinctHolidayFlag = CitiBikeHolidayFlagTest[["StartFloorDate", "HolidayFlag"]].drop_duplicates().sort_values(
    "StartFloorDate")
    
    #Print
display(DistinctHolidayFlag)
    
    # Clean up Variables
del CitiBikeHolidayFlagTest
del DistinctHolidayFlag
StartFloorDate HolidayFlag
107323 12/22/2013 0
100367 12/23/2013 0
89342 12/24/2013 0
86082 12/25/2013 1
76319 12/26/2013 0
64599 12/27/2013 0
52577 12/28/2013 0
Missing Data from Source

Accounting for missing data is a crucial part of our analysis. At first glance, it is very apparent that we have a large amount of missing data in the Gender and Birth Year attributes from our source CitiBike Data. We have already had to handle for missing Birth Year attributes while computing "Age" in our Data Load from CSV section of this paper. This was done to create a DEFAULT value of (0), such that future computations do not result in NA values as well. Gender has also already accounted for missing values with a default value of (0) by the source data. Although we have handled these missing values with a default, we want to ensure that we 'need' these records for further analysis - or if we may remove them from the dataset. Below you will see a table showing the frequency of missing values(or forced default values) by usertype. We noticed that of the 4881384 Subscribing Members in our dataset, only 295 of them were missing Gender information, whereas out of the 680909 Customer Users (Non-Subscribing), there was only one observation where we had complete information for both Gender and Birth Year. This quickly told us that removing records with missing values is NOT an option, since we would lose data for our entire Customer Usertype. These attributes, as well as Age (Computed from birth year) will serve as difficult for use in a classification model attempting to predict usertype.

We have also looked at all other attributes, and verified that there are no additional missing values in our dataset. A missing value matrix was produced to identify if there were any gaps in our data across all attributes. Due to the conclusive results in our data, no missing values present, we removed this lackluster visualization from the report.

In [7]:
NADatatestData = CitiBikeDataCompiled[["usertype","gender", "birth year"]]

NADatatestData["GenderISNA"] = np.where(CitiBikeDataCompiled["gender"] == '0', 1, 0)
NADatatestData["BirthYearISNA"] = np.where(CitiBikeDataCompiled["birth year"] == 0, 1,0)

NAAggs = pd.DataFrame({'count' : NADatatestData.groupby(["usertype","GenderISNA", "BirthYearISNA"]).size()}).reset_index()

display(NAAggs)

del NAAggs
usertype GenderISNA BirthYearISNA count
0 Customer 0 0 1
1 Customer 0 1 42
2 Customer 1 0 73
3 Customer 1 1 680793
4 Subscriber 0 0 4881089
5 Subscriber 1 0 295
Duplicate Data from Source

To ensure that there are no duplicate records in our datasets, we ensured that the number of records before and after removing potential duplicates were equal to eachother. This test passed, thus we did not need any alterations to the dataset based on duplicate records.

In [8]:
len(CitiBikeDataCompiled) == len(CitiBikeDataCompiled.drop_duplicates())
Out[8]:
True
Outlier Detection

Trip Duration In analyzing a Box Plot on trip duration values, we find extreme outliers present. With durations reaching up to 72 days in the most extreme instance, our team decided to rule out any observation with a duration greater than a 24 period. The likelihood of an individual sleeping overnight after their trip with the bike still checked out is much higher after the 24 hour period. This fact easily skews the results of this value, potentially hurting any analysis done. We move forward with removing a total of 457 observations based on trip duration greater than 24 hours (86,400 seconds).

In [9]:
%%time
%matplotlib inline

#CitiBikeDataCompiledBackup = CitiBikeDataCompiled
#CitiBikeDataCompiled = CitiBikeDataCompiledBackup

    # BoxPlot tripDuration - Heavy Outliers!
sns.boxplot(y = "tripduration", data = CitiBikeDataCompiled)
sns.despine()
    
    # How Many Greater than 24 hours?
print(len(CitiBikeDataCompiled[CitiBikeDataCompiled["tripduration"]>86400]))

    # Remove > 24 Hours
CitiBikeDataCompiled = CitiBikeDataCompiled[CitiBikeDataCompiled["tripduration"]<86400]
457
Wall time: 4.99 s

Once outliers are removed, we run the boxplot again, still seeing skewness in results. To try to mitigate this left-skew distribution, we decide to take a log transform on this attribute.

In [10]:
%%time
%matplotlib inline

    # BoxPlot Trip Duration AFTER removal of outliers
sns.boxplot(y = "tripduration", data = CitiBikeDataCompiled)
sns.despine()

    # Log Transform Column Added
CitiBikeDataCompiled["tripdurationLog"] = CitiBikeDataCompiled["tripduration"].apply(np.log)
Wall time: 2.23 s
In [11]:
%%time
%matplotlib inline

    # BoxPlot TripDurationLog
sns.boxplot(y = "tripdurationLog", data = CitiBikeDataCompiled)
sns.despine()
Wall time: 2.14 s

Age Similarly, we look at the distribution of Age in our dataset. Interestingly, it seems we have several outlier observations logging their birth year far enough back to cause their age to compute as 115 years old. Possible reasons for these outlier ages could be data entry errors by those who do not enjoy disclosing personal information, or possibly account sharing between a parent and a child - rendering an inaccurate data point to those actually taking the trip. Our target demographic for this study are those individuals under 65 years of age, given that they are the likely age groups to be in better physical condition for the bike share service. Given this target demographic, and the poor entries causing extreme outliers, we have chosen to limit out dataset to observations up to 65 years of age. This change removed an additional 53824 records from the dataset.

In [12]:
%%time
%matplotlib inline

    # BoxPlot Age - Outliers!
sns.boxplot(y = "Age", data = CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]!= 0])
sns.despine()
    
    # How Many Greater than 65 years old?
print(len(CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]>65]))

    # Remove > 65 years old
CitiBikeDataCompiled = CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]<=65]
53824
Wall time: 12.1 s
In [13]:
%%time
%matplotlib inline

    # BoxPlot Age - removed Outliers!
sns.boxplot(y = "Age", data = CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]!= 0])
sns.despine()
Wall time: 7.45 s
Record Sampling to 500,000 Records

Given the extremely large volume of data collected, we have have decided to try to sample down to ~ 1/10th of the original dataset for a total of 500,000 records. Before taking this action, however we wanted to ensure that we keep data proportions reasonable for analysis and ensure we do not lose any important demographic in our data.

Below we compute the percentage of our Dataset that comprises of Customers vs. Subscribers. We want to make sure that our sample is representative of the population dataset, so we stratify our sample to match the original data proportions.

In [14]:
%matplotlib inline
UserTypeDist = pd.DataFrame({'count' : CitiBikeDataCompiled.groupby(["usertype"]).size()}).reset_index()
display(UserTypeDist)

UserTypeDist.plot.pie(y = 'count', labels = ['Customer', 'Subscriber'], autopct='%1.1f%%')
usertype count
0 Customer 680796
1 Subscriber 4827216
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x27a80ed9208>

Given these distribution percentages we are then able to compute the sample size for each usertype and then take a random sample within each group. Below you will see that our sampled distribution matches that of the original Dataset between Customers and Subscriber Usertypes.

In [15]:
SampleSize = 500000

CustomerSampleSize_Seed   = int(round(SampleSize * 12.4 / 100.0,0))
SubscriberSampleSize_Seed = int(round(SampleSize * 87.6 / 100.0,0))

CitiBikeCustomerDataSampled = CitiBikeDataCompiled[CitiBikeDataCompiled["usertype"] == 'Customer'].sample(n=CustomerSampleSize_Seed, replace = False, random_state = CustomerSampleSize_Seed)
CitiBikeSubscriberDataSampled = CitiBikeDataCompiled[CitiBikeDataCompiled["usertype"] == 'Subscriber'].sample(n=SubscriberSampleSize_Seed, replace = False, random_state = SubscriberSampleSize_Seed)

CitiBikeDataSampled = pd.concat([CitiBikeCustomerDataSampled,CitiBikeSubscriberDataSampled])

print(len(CitiBikeDataSampled))

UserTypeDist = pd.DataFrame({'count' : CitiBikeDataSampled.groupby(["usertype"]).size()}).reset_index()
display(UserTypeDist)

UserTypeDist.plot.pie(y = 'count', labels = ['Customer', 'Subscriber'], autopct='%1.1f%%')

del CitiBikeDataCompiled
500000
usertype count
0 Customer 62000
1 Subscriber 438000

Visualize appropriate statistics

With the massive data set randomly sampled out to 500000 entries, we can more easily begin to explore the data available to us. First we began by running basic descriptive statistics for the data to get a high level, top down view of the Citi Bike Rentals we sampled.

Descriptive Statistics

With the first table, we looked at the categorical/non-numerical data that was available to us. For the stations found within the sampled data, we managed to sample the same number of unique stations that is found in the collected data set giving us a good start in assuming a representative sample.

In [16]:
CitiBikeDataSampled.describe(include=['O']).transpose()
Out[16]:
count unique top freq
start_station_id 500000 330 519 5226
start_station_name 500000 330 Pershing Square N 5226
end_station_id 500000 330 497 5281
end_station_name 500000 330 E 17 St & Broadway 5281
bikeid 500000 6679 20382 139
usertype 500000 2 Subscriber 438000
gender 500000 3 1 335793
DayOfWeek 500000 7 Wednesday 77059
TimeOfDay 500000 5 Evening 169944
HolidayFlag 500000 2 0 487796

Next we reviewed our numerical data, pulling mean, standard deviation, as well as the quartiles for each feature. Trip Duration, as a reminder, is measured in seconds, with the minimum trip duration being 1 minute. Shorter trip durations were elimindated by Citi Bike in the source data to remove possible errors as a result of people putting a bike back after removing it among other possible explanations for such short ride times.

Among all the numerical statistics, we're largely focused on Age, Trip Duration, Linear Distance as well as the weather statistical data. Birth Year was primarily used to calculate Age from trip start date. The data presented here for station latitudes and longitudes is largely unable to be used in this form, but will prove valueable for constructing location heatmaps.

In [17]:
CitiBikeDataSampled.describe().transpose()
Out[17]:
count mean std min 25% 50% 75% max
tripduration 500000.0 858.452812 1354.407817 60.000000 398.000000 634.000000 1041.000000 85939.000000
start_station_latitude 500000.0 40.734435 0.019850 40.680342 40.720664 40.736245 40.750200 40.770513
start_station_longitude 500000.0 -73.990926 0.012373 -74.017134 -74.000040 -73.990765 -73.981923 -73.950048
end_station_latitude 500000.0 40.734092 0.019868 40.680342 40.720434 40.735439 40.749718 40.770513
end_station_longitude 500000.0 -73.991057 0.012464 -74.017134 -74.000264 -73.990765 -73.981948 -73.950048
birth year 500000.0 1730.909574 651.272833 0.000000 1963.000000 1976.000000 1983.000000 1997.000000
LinearDistance 500000.0 1.121768 0.844987 0.000000 0.536074 0.884606 1.466230 6.491681
PRCP 500000.0 0.062117 0.201216 0.000000 0.000000 0.000000 0.010000 1.980000
SNOW 500000.0 0.038693 0.440194 0.000000 0.000000 0.000000 0.000000 11.000000
TAVE 500000.0 61.729642 16.755311 11.000000 50.000000 65.000000 75.000000 90.000000
TMAX 500000.0 68.556806 17.306384 17.000000 56.000000 73.000000 82.000000 98.000000
TMIN 500000.0 54.414774 16.471105 4.000000 43.000000 57.000000 68.000000 83.000000
Age 500000.0 32.589888 15.711623 0.000000 26.000000 33.000000 43.000000 65.000000
tripdurationLog 500000.0 6.463263 0.716593 4.094345 5.986452 6.452049 6.947937 11.361393
Correlated Numerical Variables

Before we begin visualization some of the more specific variable interactions it's important to see the magnitude of possible correlations which we found using Panda's Pearson Correlation function. We'll ignore the obvious, such as birth year's correlation with age and tripdurationLog's correlation with tripduration, as well as most correlations involving latitude and longitude.

Right away we noticed the large correlations between start station coordinates and end station coordinates. At first, we suspected that it was because renters would start and end at the same station, but we ruled that out by running a count of all records where the start_station_id equaled the end_station_id and found it to contain less than 2.5% of all records. It's far more likely that we're probably seeing colinearity due to the close proximity of all the stations and the low variance between coordinates.

Other than that, no singular set of features displayed strong correlation. Possibly for tripduration and LinearDistance, which is only to be expected. That said, Age did seem to have some indication of correlation with trip duration, if weak, as well as correlations with the weather attributes. This possibly indicates that age plays a factor in determining whether or not somone rents a bike during certain weather, and how far or how long they travel.

In [18]:
CitiBikeDataSampled.corr()
Out[18]:
tripduration start_station_latitude start_station_longitude end_station_latitude end_station_longitude birth year LinearDistance PRCP SNOW TAVE TMAX TMIN Age tripdurationLog
tripduration 1.000000 -0.005869 0.003117 -0.011039 0.009245 -0.178884 0.213801 -0.003804 -0.010880 0.069935 0.068630 0.070468 -0.127536 0.584256
start_station_latitude -0.005869 1.000000 0.205194 0.604507 0.083955 0.033908 -0.045915 0.000784 0.000034 -0.028017 -0.026959 -0.028730 0.061325 0.003075
start_station_longitude 0.003117 0.205194 1.000000 0.098436 0.427627 0.032677 0.012287 0.005539 0.007632 -0.029121 -0.029214 -0.028549 0.011036 0.006132
end_station_latitude -0.011039 0.604507 0.098436 1.000000 0.197283 0.031408 -0.066137 -0.000028 0.000785 -0.030836 -0.029865 -0.031381 0.060129 -0.010492
end_station_longitude 0.009245 0.083955 0.427627 0.197283 1.000000 0.026512 0.009566 0.003652 0.006380 -0.022950 -0.023123 -0.022394 0.004923 0.010440
birth year -0.178884 0.033908 0.032677 0.031408 0.026512 1.000000 -0.038977 0.018562 0.028474 -0.180298 -0.176466 -0.182285 0.770860 -0.283725
LinearDistance 0.213801 -0.045915 0.012287 -0.066137 0.009566 -0.038977 1.000000 -0.004260 -0.022267 0.107582 0.107179 0.106443 -0.033102 0.510064
PRCP -0.003804 0.000784 0.005539 -0.000028 0.003652 0.018562 -0.004260 1.000000 0.203194 0.046707 0.042352 0.049048 0.015838 -0.016911
SNOW -0.010880 0.000034 0.007632 0.000785 0.006380 0.028474 -0.022267 0.203194 1.000000 -0.178241 -0.176072 -0.177325 0.029780 -0.023780
TAVE 0.069935 -0.028017 -0.029121 -0.030836 -0.022950 -0.180298 0.107582 0.046707 -0.178241 1.000000 0.992815 0.991744 -0.161081 0.149086
TMAX 0.068630 -0.026959 -0.029214 -0.029865 -0.023123 -0.176466 0.107179 0.042352 -0.176072 0.992815 1.000000 0.969710 -0.157519 0.148058
TMIN 0.070468 -0.028730 -0.028549 -0.031381 -0.022394 -0.182285 0.106443 0.049048 -0.177325 0.991744 0.969710 1.000000 -0.162760 0.148134
Age -0.127536 0.061325 0.011036 0.060129 0.004923 0.770860 -0.033102 0.015838 0.029780 -0.161081 -0.157519 -0.162760 1.000000 -0.194335
tripdurationLog 0.584256 0.003075 0.006132 -0.010492 0.010440 -0.283725 0.510064 -0.016911 -0.023780 0.149086 0.148058 0.148134 -0.194335 1.000000
In [19]:
CitiBikeDataSampled.query('start_station_id == end_station_id')["start_station_id"].count() / CitiBikeDataSampled["start_station_id"].count()
Out[19]:
0.024365999999999999
Covariance Among Numerical Attributes

We also examined covariance among numerical attributes finding relatively high covariance between tripduration and the known weather attributes as well as age, further cementing the idea that these factors play into whether or not a person decides to rent and if they do, how long they ride for. Eventually we'll explore at what point these decisions to ride ultimately culminate in the transition from customer to describer, but for now, let's examine age.

In [20]:
CitiBikeDataSampled.cov()
Out[20]:
tripduration start_station_latitude start_station_longitude end_station_latitude end_station_longitude birth year LinearDistance PRCP SNOW TAVE TMAX TMIN Age tripdurationLog
tripduration 1.834421e+06 -1.577880e-01 0.052229 -2.970591e-01 0.156072 -157791.968952 244.686291 -1.036751e+00 -6.486603e+00 1587.063089 1608.671743 1572.031193 -2713.952212 567.054707
start_station_latitude -1.577880e-01 3.940325e-04 0.000050 2.384088e-04 0.000021 0.438364 -0.000770 3.133001e-06 2.996972e-07 -0.009318 -0.009261 -0.009394 0.019126 0.000044
start_station_longitude 5.222931e-02 5.039805e-05 0.000153 2.419854e-05 0.000066 0.263323 0.000128 1.379070e-05 4.156898e-05 -0.006037 -0.006256 -0.005818 0.002145 0.000054
end_station_latitude -2.970591e-01 2.384088e-04 0.000024 3.947386e-04 0.000049 0.406403 -0.001110 -1.111496e-07 6.861172e-06 -0.010265 -0.010269 -0.010269 0.018770 -0.000149
end_station_longitude 1.560723e-01 2.077191e-05 0.000066 4.885486e-05 0.000155 0.215211 0.000101 9.158465e-06 3.500221e-05 -0.004793 -0.004988 -0.004597 0.000964 0.000093
birth year -1.577920e+05 4.383635e-01 0.263323 4.064031e-01 0.215211 424156.302562 -21.449952 2.432449e+00 8.163142e+00 -1967.458366 -1988.981100 -1955.403446 7887.864935 -132.413969
LinearDistance 2.446863e+02 -7.701396e-04 0.000128 -1.110325e-03 0.000101 -21.449952 0.714003 -7.243800e-04 -8.282280e-03 1.523149 1.567346 1.481464 -0.439462 0.308850
PRCP -1.036751e+00 3.133001e-06 0.000014 -1.111496e-07 0.000009 2.432449 -0.000724 4.048796e-02 1.799774e-02 0.157469 0.147482 0.162556 0.050071 -0.002438
SNOW -6.486603e+00 2.996972e-07 0.000042 6.861172e-06 0.000035 8.163142 -0.008282 1.799774e-02 1.937710e-01 -1.314636 -1.341350 -1.285694 0.205964 -0.007501
TAVE 1.587063e+03 -9.318457e-03 -0.006037 -1.026526e-02 -0.004793 -1967.458366 1.523149 1.574688e-01 -1.314636e+00 280.740462 287.890423 273.700137 -42.405000 1.790032
TMAX 1.608672e+03 -9.261326e-03 -0.006256 -1.026892e-02 -0.004988 -1988.981100 1.567346 1.474818e-01 -1.341350e+00 287.890423 299.510912 276.421012 -42.831195 1.836166
TMIN 1.572031e+03 -9.393507e-03 -0.005818 -1.026942e-02 -0.004597 -1955.403446 1.481464 1.625561e-01 -1.285694e+00 273.700137 276.421012 271.297303 -42.120212 1.748437
Age -2.713952e+03 1.912604e-02 0.002145 1.876972e-02 0.000964 7887.864935 -0.439462 5.007057e-02 2.059642e-01 -42.405000 -42.831195 -42.120212 246.855106 -2.187983
tripdurationLog 5.670547e+02 4.374334e-05 0.000054 -1.493731e-04 0.000093 -132.413969 0.308850 -2.438384e-03 -7.501117e-03 1.790032 1.836166 1.748437 -2.187983 0.513506
Age Distribution

With age, we found a right skewed distribution (common with population age variables) with the majority of our renters falling below 35. As an additional note, nearly all the age data was provided by subscribers as most customers (non-subscribers) either did not input an age, or input an age that fell outside our expectations as described in data quality. All further analysis of age and its relationship with other attributes will be done through the lens of knowing that it only describes subscribers and not necessarily the entire population of Citi Bike Riders/Renters.

In [21]:
sns.distplot(CitiBikeDataSampled.query('Age != 0')["Age"])
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x27addc15f28>
Linear Distance vs Trip Duration

As part of our initial exploration of the data and building out all the attributes we could for examining the data set was attempting to see how far people were traveling during their rental period. Shy of making navigation calls to Google Maps or someother navigation service, we decided to calculate linear distance between the start and end station to give us the approximate distance traveled. We acknowledge, however, that riders weren't necessarily traveling from one station to another directly - quite a few rode for several hours and returned their bikes to stations only blocks apart. Likewise, there existed a number of riders that apparently traveled zero miles despite having trip durations in the minutes.

Below is a joint grid plotted out with distribution graphs on each axis for tripdurationLog and tripdistance. While it's not a perfect correlation, we do see that both are skewed in regards to higher values with a positive correlation. Further analysis will be necessary to draw any statistically significant conclusions, but this data combined with a study on average biking speed could be used to determine whether or not riders are simply using the bikes as transportation from one station to another, or as a means to travel outside the range of those stations.

In [22]:
dvd = sns.JointGrid(x="tripdurationLog", y="LinearDistance", data=CitiBikeDataSampled.query("LinearDistance > 0"))
dvd = dvd.plot(sns.regplot, sns.distplot)

Visualize the most interesting attributes

To re-iterate, our main objectives in analyzing these data are to determine which attributes have greatest bearing on predicting a rider's type (Customer vs. Subscriber) and to gain a better understanding of rider behavior as a function of external factors. Many attributes in this data set will eventually be used in subsequent labs to answer these questions. The primary attributes on which we will focus our attention in this section, however, are as follows:

  • Starting Location
  • Day of the Week
  • Time of Day
  • Trip Duration (both log and non-log)
  • Linear Distance
  • Gender
  • Age

Over the course of this section, we will review these top attributes in some detail and discuss the value of using our chosen visualizations. Note also that merged weather data is of significant interest as well. As we desire to heavily compare weather conditions against various rider habits, however, we will refrain from focusing on weather-related attributes until the subsequent sections.

Geophysical Start Stations HeatMap

Before discussing the following heatmap in detail, it is worth noting some special steps required to use the gmaps module in Python in case the reader is interested in rendering our code to plot data on top of Google's maps (Note full instructions are available at https://media.readthedocs.org/pdf/jupyter-gmaps/latest/jupyter-gmaps.pdf)

Besides having Jupyter Notebook installed on one's computer with extensions enabled (default if using Anaconda) and installing the gmaps module using pip, the following line should be run from within the command terminal. This is only to be done once and should be done when Jupyter Notebook is not running.

$ jupyter nbextension enable --py gmaps

In addition to running the above line in the command prompt, a Standard Google API user key will need obtained from https://developers.google.com/maps/documentation/javascript/get-api-key. This only needs done once and is necessary to pull the Google map data into the Jupyter Notebook environment. The key is entered in the gmaps.configure() line as shown in the below cell. We have provided our own private key in the meantime for the reader's convenience.

Now on to the data visualization... This geological heatmap visualization is interactive; however, the kernel must run the code block each time our Jupyter Notebook file is opened due to the API key requirement. Therefore, we've captured some interesting views to aid in our discussion and have included them as embedded images.

The start station heatmap represents the start station location data via attributes start_station_latitude and start_station_longitude. It identifies areas of highest and lowest concentration for trip starts. The location data is important as it helps us understand where the areas of highest activity are and, as will be seen in one of our later sections, will play an important role in identifying riders as regular customers or subscribers.

In [23]:
%%time

gmaps.configure(api_key="AIzaSyAsBi0MhgoQWfoGMSl5UcD-vR6H76cntxg") # Load private Google API key

locations = CitiBikeDataSampled[['start_station_latitude', 'start_station_longitude']].values.tolist()

m = gmaps.Map()
heatmap_layer = gmaps.Heatmap(data = locations)
m.add_layer(heatmap_layer)
Wall time: 15 s

An overall view quickly reveals that station data was only provided for areas of NYC south of Manhattan and mostly north of Brooklyn. This could either mean that the bike share program had not yet expanded into these other areas at the time of data collection or that the data simply wasn't included (as mentioned previously, many test sites were being used during this time frame but CitiBike did not include them with this data set).

Within the range of trip start frequency from the least number of trips (green) to the most trips (red), green and yellow indicate low to medium trip activity in most areas. However, higher pockets of concentration do exist in some places. We will attempt to put this visualization to good use by focusing in on one of these hotspots.

In [24]:
m

A prominant heatspot occurs just east of Bryant Park and the Midtown map label. Zooming into this area (via regular Google Map controls as the rendered visual is interactive) allows for a closer look. A snapshot of this zoomed in image is embedded below. The hotspot seems slightly elongated and stands out from among the other stations. Zooming in further will help to understand why this is and may shed some light on the higher activity in this area.

Zooming in to this area further helps us see that two stations are very close together. Even so, why might there be such high rider activity at these stations? This higher activity is likely affected by the stations' proximity to the famous Grand Central Station. As commuters and recreationalists alike arrive by train at Grand Central, it is natural that many of them may choose to continue their journey via the two closest bike share stations nearby. When the northernmost bike share station runs out of bikes, riders likely go to the next station to begin their ride instead.

By understanding the dynamics of geographical activity within this data set and the amenities that surround each station, we will be able to more efficiently leverage the data to make our classification and regression predictions.

Box Plot for Log Trip Duration by Day of the Week

Another attribute of interest is that of trip duration and days of the week. It can be expected that activity should vary depending on what day of the week riders are traveling. Not only are trip durations expected to vary, but with further analysis we expect the days of travel to have some influence on whether riders are bike share subscribers or not. To obtain a quick understanding of day-to-day variance in trip duration, a box plot is used.

The following interactive box plots are setup such that the log of trip duration is on the y-axis with the categorical days of the week along the x-axis. Once again, the log transformed trip duration is used to help normalize the distribution for easier analysis. The plots reveal that trip durations do not vary much throughout the week. However, there is some increase on the weekends. Zooming in on the plot to focus in mainly on the IQR regions helps to put this increase in activity into perspective.

Though further analysis would be required, the data suggests that riders are spending more time riding on Saturdays and Sundays. Of greater interest will be Customer vs. Subscriber activity across each day of the week. This will be discussed further later.

In [25]:
td = CitiBikeDataSampled.tripdurationLog

sun = td.loc[CitiBikeDataSampled["DayOfWeek"] == 'Sunday']
mon = td.loc[CitiBikeDataSampled["DayOfWeek"] == 'Monday']
tue = td.loc[CitiBikeDataSampled["DayOfWeek"] == 'Tuesday']
wed = td.loc[CitiBikeDataSampled["DayOfWeek"] == 'Wednesday']
thu = td.loc[CitiBikeDataSampled["DayOfWeek"] == 'Thursday']
fri = td.loc[CitiBikeDataSampled["DayOfWeek"] == 'Friday']
sat = td.loc[CitiBikeDataSampled["DayOfWeek"] == 'Saturday']

sunday = go.Box(y=sun, name='Sunday')
monday = go.Box(y=mon, name='Monday')
tuesday = go.Box(y=tue, name='Tuesday')
wednesday = go.Box(y=wed, name='Wednesday')
thursday = go.Box(y=thu, name='Thursday')
friday = go.Box(y=fri, name='Friday')
saturday = go.Box(y=sat, name='Saturday')

layout = go.Layout(title='Log Trip Duration by Day of Week', xaxis=dict(title='DayOfWeek'), yaxis=dict(title='tripdurationLog (log sec)'))

data = [sunday, monday, tuesday, wednesday, thursday, friday, saturday]
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
Box Plot for Linear Trip Distance by Day of the Week

In follow-up to the previous log trip duration box plots, it is also worth reviewing linear distance traveled by riders for each day of the week. Linear distance traveled also appears to remain constant throughout the week for all riders. Unlike the trip duration, there seems to be little change in distance even on the weekends. While change in distance does remain constant throughout the week, grouping changes in distance by other categories later may help to reveal more about rider activity in the data. This will be the case with user typer as seen shortly.

In [26]:
ld = CitiBikeDataSampled.LinearDistance

sun = ld.loc[CitiBikeDataSampled["DayOfWeek"] == 'Sunday']
mon = ld.loc[CitiBikeDataSampled["DayOfWeek"] == 'Monday']
tue = ld.loc[CitiBikeDataSampled["DayOfWeek"] == 'Tuesday']
wed = ld.loc[CitiBikeDataSampled["DayOfWeek"] == 'Wednesday']
thu = ld.loc[CitiBikeDataSampled["DayOfWeek"] == 'Thursday']
fri = ld.loc[CitiBikeDataSampled["DayOfWeek"] == 'Friday']
sat = ld.loc[CitiBikeDataSampled["DayOfWeek"] == 'Saturday']

sunday = go.Box(y=sun, name='Sunday')
monday = go.Box(y=mon, name='Monday')
tuesday = go.Box(y=tue, name='Tuesday')
wednesday = go.Box(y=wed, name='Wednesday')
thursday = go.Box(y=thu, name='Thursday')
friday = go.Box(y=fri, name='Friday')
saturday = go.Box(y=sat, name='Saturday')

layout = go.Layout(title='Linear Distance by Day of Week', xaxis=dict(title='DayOfWeek'), yaxis=dict(title='LinearDistance (miles)'))

data = [sunday, monday, tuesday, wednesday, thursday, friday, saturday]
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
Violin Plot for Log Trip Duration by Day of the Week and Gender

As discussed previously, we do not have complete coverage in our dataset of gender values. Nevertheless, we are interested to know if there are significant differences between male and female bikers. Since the majority of customers (i.e. non-subscribers) do not provide CitiBike with gender details, this research is indicative mainly for subscribing members. We chose to first build a violin plot with the day of week on the x-axis and log trip duration on the y axis. We broke down the data into two separate violins per day for male vs. female results. As discussed earlier, we see consistencies in trip duration from day to day in both male and female bikers. Interestingly females, in general, have ridden for longer durations than males. One could speculate that this could be due to additional excursions (e.g. shopping, events, etc.), or slower biking speeds during the trip.

In [27]:
sns.set(style="whitegrid", palette="pastel", color_codes=True)

# Load our subset data set
sub = CitiBikeDataSampled.query('gender != "0"')

# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="DayOfWeek", y="tripdurationLog", hue="gender", data=sub, split=False,
               inner="quart", palette={"1": "b", "2": "pink"}, linewidth=0.5,
              order=["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"])
sns.despine(left=True)
Violin Plot for Linear Distance by Day of the Week and Gender

To complement these findings above (Log Trip Duration by Day of Week and Gender) and attempt to gain some additional insight into some of the reasoning behind females with longer trip durations, we build a second violoin plot. This time, we chose to plot day of week on the x-axis and LinearDistance on the y-axis. This yielded much less insightful information, as "linear distance" is not indicative of the actual "trip distance". The plotted linear distances appear to be fairly consistent both from day to day (as discussed previously), and between males and females. Further research and / or more insightful data points, such as "trip distance", or "average trip speed" could help gain further insights. These additional data points will be discussed later in the paper.

In [28]:
sns.violinplot(x="DayOfWeek", y="LinearDistance", hue="gender", data=sub, split=False,
               inner="quart", palette={"1": "b", "2": "pink"}, linewidth=0.5,
              order=["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"])
sns.despine(left=True)
Stacked Bar Plot Trip Duration by Day of Week and Time of Day

Another attribute that may potentially reveal much about Customer vs. Subscriber activity is time of day. Again, the time of day attribute is a categorical variable with a group assigned to a ride depending on which range of hours throughout the day the bike ride begins. While trip durations are expected to change by time of day, it would likely be misleading to lump times of day regardless of days of the week since rider activity is expected to change based on work schedule or weekend activities.

The interactive stacked bar plot below is more suited for raw trip duration data rather than log-transformed data. The raw data accentuates the true differences from day-to-day and across time slots. Because the raw data is strongly left-skewed, the median value from each day-time grouping was obtained and rendered in the plot. By stacking the median trip durations for each time slot for each respective day, we are able to better understand the time slot proportionality differences across each day of the week. Stacking the time slot median trip durations also provides a total sum of the time slot medians for each day, making it easier compare overall activity for each day.

Correlating with the "Box Plot for Log Trip Duration by Day of the Week" visualization above, trip duration activity does increase on the weekends. Not only that, but hovering over each day bar reveals that Midday, Afternoon, and Evening activity are noticeably higher on Saturdays and Sundays than they are on weekdays whereas Morning and Night activity is relatively consistent. Understanding these trends may help us understand rider intent throughout the week and how it affects their decision to subscribe to Citi Bike or not. Being cognisant of these trends may also improve inventory at some stations at peaks times of the day when also considering start and end location frequency.

In [29]:
td = CitiBikeDataSampled.tripduration

# Extract morning data
sunMorning = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Sunday') & (CitiBikeDataSampled["TimeOfDay"] == 'Morning')]
monMorning = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Monday') & (CitiBikeDataSampled["TimeOfDay"] == 'Morning')]
tueMorning = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Tuesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Morning')]
wedMorning = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Wednesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Morning')]
thuMorning = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Thursday') & (CitiBikeDataSampled["TimeOfDay"] == 'Morning')]
friMorning = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Friday') & (CitiBikeDataSampled["TimeOfDay"] == 'Morning')]
satMorning = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Saturday') & (CitiBikeDataSampled["TimeOfDay"] == 'Morning')]

# Extract midday data
sunMidday = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Sunday') & (CitiBikeDataSampled["TimeOfDay"] == 'Midday')]
monMidday = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Monday') & (CitiBikeDataSampled["TimeOfDay"] == 'Midday')]
tueMidday = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Tuesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Midday')]
wedMidday = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Wednesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Midday')]
thuMidday = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Thursday') & (CitiBikeDataSampled["TimeOfDay"] == 'Midday')]
friMidday = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Friday') & (CitiBikeDataSampled["TimeOfDay"] == 'Midday')]
satMidday = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Saturday') & (CitiBikeDataSampled["TimeOfDay"] == 'Midday')]

# Extract afternoon data
sunAfternoon = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Sunday') & (CitiBikeDataSampled["TimeOfDay"] == 'Afternoon')]
monAfternoon = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Monday') & (CitiBikeDataSampled["TimeOfDay"] == 'Afternoon')]
tueAfternoon = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Tuesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Afternoon')]
wedAfternoon = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Wednesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Afternoon')]
thuAfternoon = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Thursday') & (CitiBikeDataSampled["TimeOfDay"] == 'Afternoon')]
friAfternoon = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Friday') & (CitiBikeDataSampled["TimeOfDay"] == 'Afternoon')]
satAfternoon = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Saturday') & (CitiBikeDataSampled["TimeOfDay"] == 'Afternoon')]

# Extract evening data
sunEvening = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Sunday') & (CitiBikeDataSampled["TimeOfDay"] == 'Evening')]
monEvening = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Monday') & (CitiBikeDataSampled["TimeOfDay"] == 'Evening')]
tueEvening = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Tuesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Evening')]
wedEvening = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Wednesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Evening')]
thuEvening = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Thursday') & (CitiBikeDataSampled["TimeOfDay"] == 'Evening')]
friEvening = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Friday') & (CitiBikeDataSampled["TimeOfDay"] == 'Evening')]
satEvening = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Saturday') & (CitiBikeDataSampled["TimeOfDay"] == 'Evening')]

# Extract night data
sunNight = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Sunday') & (CitiBikeDataSampled["TimeOfDay"] == 'Night')]
monNight = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Monday') & (CitiBikeDataSampled["TimeOfDay"] == 'Night')]
tueNight = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Tuesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Night')]
wedNight = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Wednesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Night')]
thuNight = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Thursday') & (CitiBikeDataSampled["TimeOfDay"] == 'Night')]
friNight = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Friday') & (CitiBikeDataSampled["TimeOfDay"] == 'Night')]
satNight = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Saturday') & (CitiBikeDataSampled["TimeOfDay"] == 'Night')]

# Compute morning averages by day
def compAvgs(sun, mon, tue, wed, thu, fri, sat):
    avgs = []
    avgs.append(statistics.median(sun))
    avgs.append(statistics.median(mon))
    avgs.append(statistics.median(tue))
    avgs.append(statistics.median(wed))
    avgs.append(statistics.median(thu))
    avgs.append(statistics.median(fri))
    avgs.append(statistics.median(sat))
    return avgs

morningAvg = compAvgs(sunMorning, monMorning, tueMorning, wedMorning, thuMorning, friMorning, satMorning)
middayAvg = compAvgs(sunMidday, monMidday, tueMidday, wedMidday, thuMidday, friMidday, satMidday)
afternoonAvg = compAvgs(sunAfternoon, monAfternoon, tueAfternoon, wedAfternoon, thuAfternoon, friAfternoon, satAfternoon)
eveningAvg = compAvgs(sunEvening, monEvening, tueEvening, wedEvening, thuEvening, friEvening, satEvening)
nightAvg = compAvgs(sunNight, monNight, tueNight, wedNight, thuNight, friNight, satNight)

# Define bar plot features
x=['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
morning = go.Bar(x=x, y=morningAvg, name='Morning')
midday = go.Bar(x=x, y=middayAvg, name='Midday')
afternoon = go.Bar(x=x, y=afternoonAvg, name='Afternoon')
evening = go.Bar(x=x, y=eveningAvg, name='Evening')
night = go.Bar(x=x, y=nightAvg, name='Night')

# Combine features and render bar plot
data = [morning, midday, afternoon, evening, night]
layout = go.Layout(barmode='stack', title='Trip Duration by Day of Week and Time of Day',
                   xaxis=dict(title='DayOfWeek'), yaxis=dict(title='tripdurationLog (sec)'))

fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
Contour Plot for Log Trip Duration with Respect to Age

Our team was interested in the correlation between the Log of Trip Duration and Age. To accomplish this, we decided to produce a joint density correlation plot between the log trip duration and Age. Once again, due to missing "birth year" data in our customer (non-subscribing) users - these insights are slightly skewed in terms of interpreting a pearson's r value produced by the correlation plot. Further computations, only on the dataset for those observations that do not contain an age of zero, provide a pearson's R value of .048. This is a very low positive correlation, confirming what is seen visually in the joint plot, with very little change in trip duration as rider age increases. What was interesting to see was the difference between the Age = 0 density rings, vs. the rest of the dataset. As was discussed previously, Age = 0 records mainly consist of customer (non-subscribing) users - thus we can see that the majority of trip durations for customers is greater than that of subscribing users. This is discussed in more detail later in the paper, but one could infer that this is possibly due to subscriber usage via routine trips vs. customer usage via events, trails, etc. Finally, we can easily see via the core density ring areas that the majority of trips by subscribing members are taken by individuals between ~25-35 years old and ~518 seconds (e^6.25). This matched our suspected core member age demographic, as these are likely working individuals in good physical condition for consistent riding.

In [30]:
cont = sns.jointplot(x=CitiBikeDataSampled.Age, y=CitiBikeDataSampled.tripdurationLog, kind='kde', color='r', )

pearsondata = CitiBikeDataSampled[CitiBikeDataSampled["Age"]!= 0][["Age", "tripdurationLog"]]
print(pearsonr(pearsondata.Age,pearsondata.tripdurationLog))

del pearsondata
(0.047856050742256608, 2.1248625184768761e-220)

Visualize Relationships Between Attributes

HeatMap of TripDuration by Day of Week and Time of Day

Although we have previously analyzed plots for both day of week and time of day trip durations, we were interested in further exploring in a different light the median trip duration values within specific days and / or time of days groupings. To do this, we produced a HeatMap of Median Trip Duration raw values (Median within Time of Day and Day of Week Groupings) with Time of Day (Ordered Morning - Night) and Day of Week (Ordered Sun. - Sat.) on the Y Axis. Right off the bat, we see the grouping pair with the largest median trip duration is Saturday Afternoons (2-5PM). Also, we can see that weekend trips are generally longer than weekday trips, with emphasis on Midday - Evening starttimes. Interestingly, we see that consistently, trip durations are higher in the evenings - probably due to travelers using the services after work hours. Most of these results, were what the team expected, and had hoped to see. In general, using median times within groups of [Day of Week, Time of Day] pairs, we observed that evenings and weekends received the largest trip durations, whereas weekday mornings received the lowest trip durations. This information could be useful for Citi Bike, in that they would know which times of day / days of week they needed to focus on driving promotions, events, etc. to increase traffic flow. Also, knowing that trips are generally longer during the weekend, could explain reasons for bike availability concerns. Bike availability is a huge part of this bike share service, and creates a huge impact on travelers satisfaction when bikes aren't available at their closest station. Combining this information with geocoordinate density maps discussed in this paper could help them support bike station shift services - moving bikes from less travelled areas to heavier travelled areas. This could potentialy help mitigate some availability loss in their service.

In [31]:
sns.set()

grouped = CitiBikeDataSampled.groupby(['DayOfWeek', 'TimeOfDay'], as_index=False)
groupAgg = grouped.aggregate(np.median)
groupAgg['DayOfWeek'] = pd.Categorical(groupAgg['DayOfWeek'], ['Sunday',
                                                               'Monday',
                                                               'Tuesday',
                                                               'Wednesday',
                                                               'Thursday',
                                                               'Friday',
                                                               'Saturday'])

groupAgg['TimeOfDay'] = pd.Categorical(groupAgg['TimeOfDay'], ['Morning',
                                                               'Midday',
                                                               'Afternoon',
                                                               'Evening',
                                                               'Night'])

groupAgg = groupAgg.sort_values(by=['DayOfWeek','TimeOfDay'])

dist0 = groupAgg[["DayOfWeek", "TimeOfDay", "tripduration"]]
dist1 = dist0.pivot("DayOfWeek", "TimeOfDay", "tripduration")

# Render DayOfWeek vs. TimeOfDay heatmap for 
sns.heatmap(dist1, annot=True, fmt="f", linewidths=0.01)
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x27add98f780>
Log Trip Duration with respect to Average Temperature Joint Density Plot

We were interested in seeing the effect temperature has on trip duration. To achieve this we created a joint density correlation plot between Average Temperature (TAVE) and the Log of Trip Duration (tripdurationLog). Interestingly, we did not see as high of a correlation as expected. With a Pearsons R correlation values of .15, there is a very small positive correlation of increasing trip duration (as depicted by a log transformation) as average temperatre increases. The team expected much more correlation, as we thought that individuals would not enjoy being on a bike during cold weather. We do however, see that the density of trip durations is highest around 75 degrees fahrenheit. This matched our expectations because although durations remained fairly unchanged, the number of trips taken decreased as average temperature decreased. This is also depicted by the skewed distribution shown on the Y axis(right side of plot). Possibly, the reason we did not see as large of a change in trip duration is due to the nature of subscriber usage of the bike share service. It is possible that subscribers use the bike share service for routine travel around the city: grocery trips, work trips, trips to meet friends, etc.. All of these things, if the bike share service is a core means for transportation, have not changed distances due to cold weather, therefore you see the number of riders decrease while keeping durations mostly consistent. Further research on this correlation between subscribers vs. customers and potentially some surveys to subscribing members could assist with proving this theory. If these insights are true during cold weather months, it could help marketing promotions for the service to working individuals in attempts to increase bike share traffic for routine trips.

In [32]:
%%time
cont = sns.jointplot(x=CitiBikeDataSampled.tripdurationLog, y=CitiBikeDataSampled.TAVE, kind='kde')
#cont.plot_joint(plt.scatter, c="w", s=3, linewidth=0.5, marker=".")
Wall time: 10min 34s

Visualize Relationships Between Features and Prediction Class

At the core of our business understanding is wanting to identify the point at which a rider moves from being a customer to a subscriber. Being able to know what qualities or features of the two separate the roles and to what degree would allow us to identify stations and areas that hold the most potential for subscriber enrollment.

Customer vs. Subscriber Trip Duration by Day of the Week Split Violin Plot

Almost universally, across every day of the week, customers appear to have a higher trip duration than subscribers. While additional analysis will be required to confirm this, it's possible that one explanation is that subscribers can freely take and return their bikes which means that they're more willing to make shorter trips versus customers that pay each time they want to rent a bike in the first place. An alternate explanation, based on what we know in regards to the relationship between trip duration and linear distance traveled, is that subscribers are using the bikes for commuting to and from specific locations. This would result in a lower trip duration than customers that might use their bikes for general travel around the city. This possibility is corroborated by the decrease in activity on the weekends by subscribers.

Identifying the point at which a customer might become a subscriber using this data would probably include monitoring weekday activity and trip duration. If a station has a lot of customers with trip durations similar to those of subscribers, then that station would be a good location to do a focused advertisement of the benefits of subscribing.

In [33]:
sns.set(style="whitegrid", palette="pastel", color_codes=True)

# Load our subset data set
sub = CitiBikeDataSampled

# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="DayOfWeek", y="tripdurationLog", hue="usertype", data=sub, split=True,
               inner="quart", palette={"Subscriber": "g", "Customer": "y"})
sns.despine(left=True)
Customer vs. Subscriber Linear Trip Distance by Day of the Week Split Violin Plot

Unlike trip duration, the linear distance between start and end stations for both customers and subscribers appear to be similar in regards to means and are close in their quartiles. But what's noticeable here, is that customers are more widely distributed in how far or near they ride, with a significant increase in the number of customers that return their bikes to the station they started from.

Further analysis will be necessary to explore the statistical significance of these differences, but it would be possible to identify those stations that are frequented by subscribers and assume that most stations within the first standard deviation of the linear distance found below to be considered "subscriber stations" and then seen which stations are outside of those zones to further build up the messaging encouraging subscription. Furthermore, by identifying those "hot zones" it's possible to rotate out bikes to increase their longevity.

In [34]:
sns.set(style="whitegrid", palette="pastel", color_codes=True)

# Load our subset data set
sub = CitiBikeDataSampled

# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="DayOfWeek", y="LinearDistance", hue="usertype", data=sub, split=True,
               inner="quart", palette={"Subscriber": "g", "Customer": "y"})
sns.despine(left=True)
Geographic Heatmap Comparing Customer vs. Subscriber Start Station Activity

After visualizing the overall dataset locations with a heatmap over NYC, we decided to take the visualization one step further. This time, we broke the dataset into two segments: Customer vs. Subscriber. Below is two separate gmap heatmaps containing geographic densities for each usertype. What we found assisted our theories on customer vs. subscriber usage tendencies. Seen first, the Customer heatmap overall contains much fewer dense regions. This helps to confirm our suspicions infering Customer bikers as less "routine" than subscribing bikers. When looking around for the most dense region in this heatmap, one point stuck out as particularly interesting: The Zoo. When comparing this region on the Subscriber gmap, we did not see the same type of traffic! This helps assist our theories that customer bikers use the service more for events, shopping, or one-time use convenience. On the subscriber gmap, the most dense region, is that near the Grand Central Station as discussed earlier - assisting in the opposing theory for subscribing members as routine trips to work, groceries, etc. as they consistently use the bike share service as a means to reach the metro station.

Customer Users

In [35]:
customerData = CitiBikeDataSampled.query('usertype == "Customer"')
customerLoc = customerData[['start_station_latitude', 'start_station_longitude']].values.tolist()

cmap = gmaps.Map()
customer_layer = gmaps.Heatmap(data=customerLoc)#, fill_color="red", stroke_color="red", scale=3)
cmap.add_layer(customer_layer)
In [36]:
cmap

Subscriber Users

In [37]:
subscriberData = CitiBikeDataSampled.query('usertype == "Subscriber"')
subscriberLoc = subscriberData[['start_station_latitude', 'start_station_longitude']].values.tolist()

smap = gmaps.Map()
subscriber_layer = gmaps.Heatmap(data=subscriberLoc)#, fill_color="green", stroke_color="green", scale=2)
smap.add_layer(subscriber_layer)
In [38]:
smap

Trip Duration and Linear Distance vs Weather by Customer/Subscriber

Because we were able to bring together historical weather data for the dates we had in our records, we wanted to explore the relationship these variables had with our usertype status. If subscribers were regularly using the bikes for commuting as we've begun to see, then weather wouldn't impact their rental stastics as much as customers who appear to be primarily opportunistic in their usage.

A quick cursory glance reveals a noticeable difference in bike rentals in regards to low temperatures, precipitation, and snowfall. While true, there are fewer customers than subscribers, we're concerned primarily with the spread or distribution of the plot points rather than the quantity. And we can see that on the customer pair plots that there are fewer points distributed across the lower temperature ranges and higher precipitation/snowfall ranges. The distributions pick back up at higher temperatures and lower precipitation points between the two usertypes.

If stations consistently see use during "bad" weather, then those stations could be identified as subscriber stations. Further, if certain customers are found making the same trips consistently in all weather types, then they could be pushed for subscription.

In [39]:
sns.pairplot(CitiBikeDataSampled.query("usertype == 'Subscriber'"), x_vars=["PRCP","SNOW","TAVE","TMAX","TMIN"], y_vars=["tripduration","tripdurationLog","LinearDistance"])
Out[39]:
<seaborn.axisgrid.PairGrid at 0x27add7b0630>
In [40]:
sns.pairplot(CitiBikeDataSampled.query("usertype == 'Customer'"), x_vars=["PRCP","SNOW","TAVE","TMAX","TMIN"], y_vars=["tripduration","tripdurationLog","LinearDistance"])
Out[40]:
<seaborn.axisgrid.PairGrid at 0x27add79a748>

Features That Could Be Added

Beyond the features we already chose to add to the original data set, there are others of particular interest that would bring much value to the existing data as well. We've documented some of our ideas below:

  • Event/Restaurant/Retail Data: Given that we have detailed geocoordinate data and have already demonstrated powerful use of the Google Maps API, it would be possible to incorporate location details surrounding Citi Bike start and stop locations. There is potential for such data to be gathered automatically using API's such as Google's. Having this data would provide further insight into some of the reasons some bike share locations are more popular than others. Such data could even help Citi Bike predict changes in station demand based on changing surroundings and help determine where new stations should be installed.
  • Special Events: Similar to the previous idea, merging other public data based on geophysical coordinates and timeline could introduce other external factors such as the effects of parades, public concerts, festivals, and other events on ride activity for a given day or week. This would help identify/predict abnormal activity in this and future data sets. Additionally, it would provide insight to Citi Bike as to how to better plan and prepare for such events to boost rental count and increase trip duration.
  • GPS Enabled Bike Computers: Though not influenced by the data we have at hand, adding bicycle tracking hardware to each Citi Bike rental would provide substantial value to future data sets. Adding GPS tracking would enable Citi Bike to track specific routes followed by clients and could even aid NYC planners with transportation projects. Having route information means that true distance covered would be available, an attribute that would have far more meaning than our LinearDistance attribute. Incorporating GPS tracking with bike speed would provide insights into individual rider activity. For example, just because a rider's trip duration was 6 hours doesn't mean they actively rode for that amount of time. It is far more likely such a rider would have stopped for an extended period of time at least once during this period of time. Adding GPS and speed data would alleviate these existing gaps.

Exceptional Work

Our team spent a substantial amount of time completing this assignment (50+ total man hours). Some of the features we would consider exceptional work include:

  • The shear volume of data alone was a challenge. Our original Citi Bike data was split across 8 files which we had to further split into 13 files due to Github file size restrictions. In total, our total record count consisted of 5562293 records, producing computational resource challenges on our personal computer hardware.
  • We chose to merge historical weather data with the pre-existing Citi Bike data, such that our set was comprised of data sourced from two independent entities. When merging all data, it would often take several hours to load the data alone.
  • We were very thorough in our data quality descriptions to ensure that no improper assumptions were made; this endeavor proved to be far more demanding than we had anticipated and proved to be a bottle-neck for our visualization creation and descriptive statistics.
  • Extensive effort was made to generate diversity in our visualizations and to incorporate interactivity where possible. This was a true learning experience as none of our previous classes have required significant use of the Python programming language.
  • Our use of the Google Maps API allowed us to view the data in ways that would have otherwise been impossible given our time constraints. Even with the additional configuration requirements for running the Google API and gmaps module, we managed to render our data over Google Maps and to gather further insights into why some stations portrayed more activity than others.